827 research outputs found
A Fast Algorithm for Robust Regression with Penalised Trimmed Squares
The presence of groups containing high leverage outliers makes linear
regression a difficult problem due to the masking effect. The available high
breakdown estimators based on Least Trimmed Squares often do not succeed in
detecting masked high leverage outliers in finite samples.
An alternative to the LTS estimator, called Penalised Trimmed Squares (PTS)
estimator, was introduced by the authors in \cite{ZiouAv:05,ZiAvPi:07} and it
appears to be less sensitive to the masking problem. This estimator is defined
by a Quadratic Mixed Integer Programming (QMIP) problem, where in the objective
function a penalty cost for each observation is included which serves as an
upper bound on the residual error for any feasible regression line. Since the
PTS does not require presetting the number of outliers to delete from the data
set, it has better efficiency with respect to other estimators. However, due to
the high computational complexity of the resulting QMIP problem, exact
solutions for moderately large regression problems is infeasible.
In this paper we further establish the theoretical properties of the PTS
estimator, such as high breakdown and efficiency, and propose an approximate
algorithm called Fast-PTS to compute the PTS estimator for large data sets
efficiently. Extensive computational experiments on sets of benchmark instances
with varying degrees of outlier contamination, indicate that the proposed
algorithm performs well in identifying groups of high leverage outliers in
reasonable computational time.Comment: 27 page
Graph-Embedding Empowered Entity Retrieval
In this research, we improve upon the current state of the art in entity
retrieval by re-ranking the result list using graph embeddings. The paper shows
that graph embeddings are useful for entity-oriented search tasks. We
demonstrate empirically that encoding information from the knowledge graph into
(graph) embeddings contributes to a higher increase in effectiveness of entity
retrieval results than using plain word embeddings. We analyze the impact of
the accuracy of the entity linker on the overall retrieval effectiveness. Our
analysis further deploys the cluster hypothesis to explain the observed
advantages of graph embeddings over the more widely used word embeddings, for
user tasks involving ranking entities
Crowdsourcing Dialect Characterization through Twitter
We perform a large-scale analysis of language diatopic variation using
geotagged microblogging datasets. By collecting all Twitter messages written in
Spanish over more than two years, we build a corpus from which a carefully
selected list of concepts allows us to characterize Spanish varieties on a
global scale. A cluster analysis proves the existence of well defined
macroregions sharing common lexical properties. Remarkably enough, we find that
Spanish language is split into two superdialects, namely, an urban speech used
across major American and Spanish citites and a diverse form that encompasses
rural areas and small towns. The latter can be further clustered into smaller
varieties with a stronger regional character.Comment: 10 pages, 5 figure
Robust artificial neural networks and outlier detection. Technical report
Large outliers break down linear and nonlinear regression models. Robust
regression methods allow one to filter out the outliers when building a model.
By replacing the traditional least squares criterion with the least trimmed
squares criterion, in which half of data is treated as potential outliers, one
can fit accurate regression models to strongly contaminated data.
High-breakdown methods have become very well established in linear regression,
but have started being applied for non-linear regression only recently. In this
work, we examine the problem of fitting artificial neural networks to
contaminated data using least trimmed squares criterion. We introduce a
penalized least trimmed squares criterion which prevents unnecessary removal of
valid data. Training of ANNs leads to a challenging non-smooth global
optimization problem. We compare the efficiency of several derivative-free
optimization methods in solving it, and show that our approach identifies the
outliers correctly when ANNs are used for nonlinear regression
Robust Fuzzy Clustering via Trimming and Constraints
Producción CientíficaA methodology for robust fuzzy clustering is proposed. This
methodology can be widely applied in very different statistical problems given
that it is based on probability likelihoods. Robustness is achieved by trimming
a fixed proportion of “most outlying” observations which are indeed
self-determined by the data set at hand. Constraints on the clusters’ scatters
are also needed to get mathematically well-defined problems and to avoid the
detection of non-interesting spurious clusters. The main lines for computationally
feasible algorithms are provided and some simple guidelines about
how to choose tuning parameters are briefly outlined. The proposed methodology
is illustrated through two applications. The first one is aimed at heterogeneously
clustering under multivariate normal assumptions and the second
one migh be useful in fuzzy clusterwise linear regression problems.Ministerio de Economía, Industria y Competitividad (MTM2014-56235-C2-1-P)Junta de Castilla y León (programa de apoyo a proyectos de investigación – Ref. VA212U13
Combining semantic web technologies with evolving fuzzy classifier eClass for EHR-based phenotyping : a feasibility study
In parallel to nation-wide efforts for setting up shared electronic health records (EHRs) across healthcare settings, several large-scale national and international projects are developing, validating, and deploying electronic EHR oriented phenotype algorithms that aim at large-scale use of EHRs data for genomic studies. A current bottleneck in using EHRs data for obtaining computable phenotypes is to transform the raw EHR data into clinically relevant features. The research study presented here proposes a novel combination of Semantic Web technologies with the on-line evolving fuzzy classifier eClass to
obtain and validate EHR-driven computable phenotypes derived from 1956 clinical statements from EHRs. The evaluation performed with clinicians demonstrates the feasibility and practical acceptability of the approach proposed
Dynamic clustering of time series with Echo State Networks
In this paper we introduce a novel methodology for unsupervised analysis of time series, based upon the iterative implementation of a clustering algorithm embedded into the evolution of a recurrent Echo State Network. The main features of the temporal data are captured by the dynamical evolution of the network states, which are then subject to a clustering procedure. We apply the proposed algorithm to time series coming from records of eye movements, called saccades, which are recorded for diagnosis of a neurodegenerative form of ataxia. This is a hard classification problem, since saccades from patients at an early stage of the disease are practically indistinguishable from those coming from healthy subjects. The unsupervised clustering algorithm implanted within the recurrent network produces more compact clusters, compared to conventional clustering of static data, and provides a source of information that could aid diagnosis and assessment of the disease.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tec
Sparse Robust Regression for Explaining Classifiers
Recipient of the best student paper award.Peer reviewe
Towards Spatial Word Embeddings
Leveraging textual and spatial data provided in spatio-textual objects (eg., tweets), has become increasingly important in real-world applications, favoured by the increasing rate of their availability these last decades (eg., through smartphones). In this paper, we propose a spatial retrofitting method of word embeddings that could reveal the localised similarity of word pairs as well as the diversity of their localised meanings. Experiments based on the semantic location prediction task show that our method achieves significant improvement over strong baselines
- …